5 research outputs found
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
We propose a max-pooling based loss function for training Long Short-Term
Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low
CPU, memory, and latency requirements. The max-pooling loss training can be
further guided by initializing with a cross-entropy loss trained network. A
posterior smoothing based evaluation approach is employed to measure keyword
spotting performance. Our experimental results show that LSTM models trained
using cross-entropy loss or max-pooling loss outperform a cross-entropy loss
trained baseline feed-forward Deep Neural Network (DNN). In addition,
max-pooling loss trained LSTM with randomly initialized network performs better
compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss
trained LSTM initialized with a cross-entropy pre-trained network shows the
best performance, which yields relative reduction compared to baseline
feed-forward DNN in Area Under the Curve (AUC) measure
A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Acoustic Echo Cancellation (AEC) is essential for accurate recognition of
queries spoken to a smart speaker that is playing out audio. Previous work has
shown that a neural AEC model operating on log-mel spectral features (denoted
"logmel" hereafter) can greatly improve Automatic Speech Recognition (ASR)
accuracy when optimized with an auxiliary loss utilizing a pre-trained ASR
model encoder. In this paper, we develop a conformer-based waveform-domain
neural AEC model inspired by the "TasNet" architecture. The model is trained by
jointly optimizing Negative Scale-Invariant SNR (SISNR) and ASR losses on a
large speech dataset. On a realistic rerecorded test set, we find that
cascading a linear adaptive AEC and a waveform-domain neural AEC is very
effective, giving 56-59% word error rate (WER) reduction over the linear AEC
alone. On this test set, the 1.6M parameter waveform-domain neural AEC also
improves over a larger 6.5M parameter logmel-domain neural AEC model by 20-29%
in easy to moderate conditions. By operating on smaller frames, the waveform
neural model is able to perform better at smaller sizes and is better suited
for applications where memory is limited.Comment: Submitted to Interspeech 202
A study of acoustic-to-articulatory inversion of speech by analysis-by-synthesis using chain matrices and the Maeda articulatory model
In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants